---
title: "BAM Phase 2: Comparative Cluster Analysis, Model Selection, and Validation"
subtitle: "Bidirectional Acculturation Model — GMM k=4 Final Solution"
author: "Jessala A. Grijalva"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
format:
  pdf:
    toc: true
    toc-depth: 3
    number-sections: true
    geometry: margin=1in
    colorlinks: true
    keep-tex: false
execute:
  echo: false
  warning: false
  message: false
  error: false
---

```{r setup}
#| label: setup

library(here)
library(pacman)
p_load(
  dplyr, tidyr, purrr, ggplot2, tibble, forcats,
  mclust, cluster, e1071, dbscan
)

knitr::opts_chunk$set(fig.align = "center")

# Global seed and analysis contract
SEED <- 2500L
VARS5 <- c("AMERICAN", "CULTURAL_IDENTITY", "KEEPSPAN", "DISTINCT", "LEARNENG")

# Political outcome variables (not used in clustering)
POL_VARS <- c("IDEOLOGY", "PARTYID", "IMMVIEW", "DREAMACT", "IMMPOLICY")
```

\newpage

# Overview

This document is a self-contained, reproducible record of the full Phase 2
analysis for the Bidirectional Acculturation Model (BAM). It supports two
manuscripts:

1. **"Myth of the Zero-Sum"** — tests the binary acculturation assumption
   against a bidirectional alternative.
2. **"Making Cluster Analysis Inferential"** — introduces the Inferential
   Cluster Analysis (ICA) framework and demonstrates it on the same data.

The analysis proceeds in four parts:

- **Part I**: Load preprocessed data and construct the clustering matrix.
- **Part II**: Comparative cluster analysis across five algorithms at k=3, 4, 5.
- **Part III**: Final GMM k=4 model, orientation labeling, and cluster profiling.
- **Part IV**: Validation — bootstrap stability (B=1,000), bootstrap confidence
  intervals on cluster means (B=400), and 5-fold cross-validation.

The only file written to disk is the final clustered dataset:
`data/processed/bam_clustered_gmm_k4.rda`.

---

# Part I — Data and Clustering Matrix

## Load preprocessed data

The canonical preprocessed dataset from Phase 1 is loaded. The object
`lns_subset` contains 4,785 eligible Latino voters from the 2006 LNS with
all missing values imputed via MICE-PMM. See `Phase_1_Preprocessing.qmd` for
full documentation.

```{r load-data}
#| label: load-data

# Try clean_data.rda first (current name), fall back to lns_clean.rda
rda_candidates <- c(
  here("data", "processed", "clean_data.rda"),
  here("data", "processed", "lns_clean.rda")
)

loaded <- FALSE
for (rda_path in rda_candidates) {
  if (file.exists(rda_path)) {
    load(rda_path)
    loaded <- TRUE
    cat("Loaded:", rda_path, "\n")
    break
  }
}
if (!loaded) stop("Could not find preprocessed data in data/processed/")

cat("Dimensions:", nrow(lns_subset), "x", ncol(lns_subset), "\n")
```

## Confirm VARS5 and construct the scaled clustering matrix

The five clustering variables (VARS5) are:

| Variable | Description | Scale |
|----------|-------------|-------|
| AMERICAN | Strength of American identity | 1–4 |
| CULTURAL_IDENTITY | Mean of regional + Latino identity | 1–4 |
| KEEPSPAN | Importance of keeping Spanish | 1–4 |
| DISTINCT | Importance of maintaining distinct culture | 1–3 |
| LEARNENG | Importance of learning English | 1–4 |

```{r build-matrix}
#| label: build-matrix

# Verify all VARS5 columns exist
missing_vars <- setdiff(VARS5, names(lns_subset))
if (length(missing_vars) > 0) {
  stop("Missing VARS5 columns: ", paste(missing_vars, collapse = ", "))
}

# Extract numeric matrix and check for completeness
X_raw <- lns_subset %>%
  select(all_of(VARS5)) %>%
  mutate(across(everything(), as.numeric))

complete_idx <- complete.cases(X_raw)
cat("Rows total:", nrow(X_raw), "\n")
cat("Rows complete on VARS5:", sum(complete_idx), "\n")
cat("Distinct points:", nrow(unique(X_raw[complete_idx, ])), "\n")

# Standardize for clustering
X <- scale(as.matrix(X_raw[complete_idx, ]))
```

\newpage

# Part II — Comparative Cluster Analysis

This section runs five clustering algorithms at k=3, 4, and 5 to evaluate
convergence across methods with fundamentally different mathematical
assumptions. The goal is diagnostic — to confirm that the four-cluster
structure reflects genuine data features rather than artifacts of any single
algorithm.

## Shared evaluation functions

```{r eval-functions}
#| label: eval-functions

# Adjusted Rand Index (robust to NAs)
adj_ari <- function(a, b) {
  m <- !is.na(a) & !is.na(b)
  if (sum(m) == 0) return(NA_real_)
  mclust::adjustedRandIndex(a[m], b[m])
}

# Format cluster sizes as a string
fmt_sizes <- function(cl) {
  cl <- cl[!is.na(cl)]
  if (length(cl) == 0) return(NA_character_)
  paste(sort(as.integer(table(cl))), collapse = ", ")
}

# Silhouette on a subsample for speed
silhouette_subsample <- function(X, cl, m = 1500, seed = 1) {
  set.seed(seed)
  n <- nrow(X)
  idx <- sample.int(n, size = min(m, n))
  Xs <- X[idx, , drop = FALSE]
  cls <- cl[idx]
  keep <- !is.na(cls)
  Xs <- Xs[keep, , drop = FALSE]
  cls <- cls[keep]
  if (length(unique(cls)) < 2) return(NA_real_)
  lab <- as.integer(factor(cls))
  sil <- cluster::silhouette(lab, dist(Xs))
  mean(sil[, "sil_width"], na.rm = TRUE)
}

# Calinski-Harabasz index
ch_index <- function(X, cl) {
  cl <- as.integer(factor(cl))
  k <- length(unique(cl))
  n <- nrow(X)
  if (k < 2 || k >= n) return(NA_real_)

  overall_mean <- colMeans(X)
  BSS <- 0
  WSS <- 0
  for (j in unique(cl)) {
    members <- X[cl == j, , drop = FALSE]
    nj <- nrow(members)
    center <- colMeans(members)
    BSS <- BSS + nj * sum((center - overall_mean)^2)
    WSS <- WSS + sum(scale(members, center = center, scale = FALSE)^2)
  }
  (BSS / (k - 1)) / (WSS / (n - k))
}
```

## Algorithm wrappers

```{r algo-wrappers}
#| label: algo-wrappers

# K-means with safe initialization from distinct points
fit_kmeans <- function(X, k, seed = SEED) {
  set.seed(seed)
  U <- unique(X)
  if (nrow(U) < k) stop("Not enough distinct points for k=", k)
  best <- NULL
  best_wss <- Inf
  for (s in seq_len(50)) {
    idx <- sample.int(nrow(U), k)
    km <- try(stats::kmeans(X, centers = U[idx, , drop = FALSE],
                            iter.max = 100), silent = TRUE)
    if (!inherits(km, "try-error") && km$tot.withinss < best_wss) {
      best_wss <- km$tot.withinss
      best <- km
    }
  }
  if (is.null(best)) stop("All kmeans starts failed.")
  as.integer(best$cluster)
}

# GMM (Mclust, EEV covariance)
fit_gmm <- function(X, k, seed = SEED) {
  set.seed(seed)
  g <- Mclust(X, G = k, modelNames = "EEV", verbose = FALSE)
  if (is.null(g)) return(rep(NA_integer_, nrow(X)))
  as.integer(g$classification)
}

# Fuzzy C-Means
fit_fuzzy <- function(X, k, seed = SEED) {
  set.seed(seed)
  cm <- e1071::cmeans(X, centers = k, m = 2, iter.max = 200,
                      dist = "euclidean", method = "cmeans")
  as.integer(cm$cluster)
}

# Agglomerative hierarchical (Ward's method)
fit_hier <- function(X, k, seed = SEED) {
  d <- dist(X)
  hc <- hclust(d, method = "ward.D2")
  as.integer(cutree(hc, k = k))
}

# DBSCAN (density-based, no k parameter)
fit_dbscan <- function(X, k = NULL, seed = SEED, minPts = 10,
                       kNN_k = 10, eps_quantile = 0.95) {
  set.seed(seed)
  kdist <- dbscan::kNNdist(X, k = kNN_k)
  eps <- as.numeric(quantile(kdist, probs = eps_quantile, na.rm = TRUE))
  db <- dbscan::dbscan(X, eps = eps, minPts = minPts)
  as.integer(db$cluster)
}
```

## Run comparative analysis at k = 3, 4, 5

```{r comparative-run}
#| label: comparative-run

# Methods grid — DBSCAN has no k parameter
methods <- list(
  list(name = "K-Means",      fun = fit_kmeans, takes_k = TRUE),
  list(name = "GMM (EEV)",    fun = fit_gmm,    takes_k = TRUE),
  list(name = "Fuzzy C-Means",fun = fit_fuzzy,  takes_k = TRUE),
  list(name = "Hierarchical",  fun = fit_hier,   takes_k = TRUE),
  list(name = "DBSCAN",       fun = fit_dbscan, takes_k = FALSE)
)

k_values <- c(3L, 4L, 5L)

results <- list()
for (m in methods) {
  for (k in k_values) {
    # DBSCAN ignores k
    cl <- try({
      if (m$takes_k) m$fun(X, k = k) else m$fun(X)
    }, silent = TRUE)

    if (inherits(cl, "try-error") || all(is.na(cl))) {
      results[[length(results) + 1]] <- tibble(
        method = m$name, k = k, status = "failed",
        silhouette = NA_real_, ch = NA_real_, sizes = NA_character_,
        note = if (inherits(cl, "try-error")) as.character(cl) else "All NA"
      )
      next
    }

    # For DBSCAN, k is the number of clusters found (0 = noise label)
    k_actual <- length(unique(cl[cl > 0]))

    results[[length(results) + 1]] <- tibble(
      method = m$name,
      k = if (m$takes_k) k else k_actual,
      status = "ok",
      silhouette = silhouette_subsample(X, cl, m = 1500, seed = 123),
      ch = ch_index(X, cl),
      sizes = fmt_sizes(cl),
      note = ""
    )

    # Only run DBSCAN once (it doesn't take k)
    if (!m$takes_k) break
  }
}

comp_table <- bind_rows(results)
```

## Comparative results table

```{r comparative-table}
#| label: comparative-table

knitr::kable(
  comp_table %>%
    select(method, k, status, silhouette, ch, sizes) %>%
    arrange(method, k),
  digits = 3,
  caption = "Comparative clustering performance across algorithms and k values."
)
```

## ARI agreement with GMM k=4 reference

```{r ari-comparison}
#| label: ari-comparison

# Fit GMM k=4 as the reference solution
set.seed(SEED)
gmm_ref <- Mclust(X, G = 4, modelNames = "EEV", verbose = FALSE)
cl_gmm_ref <- as.integer(gmm_ref$classification)

# Compute ARI for each k=4 solution against GMM k=4
ari_table <- tibble(
  method = c("K-Means", "Fuzzy C-Means", "Hierarchical"),
  ari_vs_gmm = c(
    adj_ari(fit_kmeans(X, k = 4), cl_gmm_ref),
    adj_ari(fit_fuzzy(X, k = 4),  cl_gmm_ref),
    adj_ari(fit_hier(X, k = 4),   cl_gmm_ref)
  )
)

knitr::kable(ari_table, digits = 3,
             caption = "Agreement (ARI) of k=4 solutions with GMM k=4 reference.")
```

```{r ari-barplot}
#| label: ari-barplot
#| fig-cap: "Agreement with GMM k=4 reference solution (ARI)."
#| fig-width: 5
#| fig-height: 3

ggplot(ari_table, aes(x = reorder(method, ari_vs_gmm), y = ari_vs_gmm)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(x = NULL, y = "ARI vs GMM (k=4)") +
  theme_minimal()
```

### Comparative interpretation

The comparative analysis evaluates five algorithms with fundamentally different
mathematical assumptions. K-means assumes spherical, equally sized clusters.
GMM (EEV) allows elliptical clusters with equal volume and shape but varying
orientation. Fuzzy C-means permits probabilistic membership. Hierarchical
clustering builds nested structure from pairwise similarities. DBSCAN identifies
density-based clusters with no distributional assumptions.

The convergence of K-Means, Hierarchical, and GMM on a four-cluster solution
constitutes evidence that the patterns reflect genuine data features. Fuzzy
C-Means and DBSCAN underperform, which is itself informative: the data contain
discrete groupings with meaningful boundaries rather than smooth gradations or
density-based structure.

\newpage

# Part III — Final GMM k=4 Model and Cluster Profiling

## Fit the validated final model

```{r fit-final-gmm}
#| label: fit-final-gmm

set.seed(SEED)
gmm_k4 <- Mclust(X, G = 4, modelNames = "EEV", verbose = FALSE)
cl_k4 <- as.integer(gmm_k4$classification)

cat("Model:", gmm_k4$modelName, "| G =", gmm_k4$G, "\n")
cat("Log-likelihood:", round(gmm_k4$loglik, 2), "\n")
cat("BIC:", round(gmm_k4$bic, 2), "\n")
cat("Cluster sizes:", fmt_sizes(cl_k4), "\n")
```

## Assign acculturation orientation labels

```{r label-orientations}
#| label: label-orientations

# Build output dataframe with cluster assignments
df_out <- lns_subset
df_out$cluster_gmm_k4 <- NA_integer_
df_out$cluster_gmm_k4[complete_idx] <- cl_k4

df_out <- df_out %>%
  mutate(
    orientation = case_when(
      cluster_gmm_k4 == 1 ~ "Culture Affirming",
      cluster_gmm_k4 == 2 ~ "Assimilationist",
      cluster_gmm_k4 == 3 ~ "Demicultural",
      cluster_gmm_k4 == 4 ~ "Bicultural",
      TRUE ~ NA_character_
    ),
    orientation = factor(orientation,
      levels = c("Culture Affirming", "Assimilationist",
                 "Demicultural", "Bicultural")
    )
  )

knitr::kable(
  df_out %>% count(orientation, name = "n") %>% mutate(pct = round(n / sum(n) * 100, 1)),
  caption = "Distribution of acculturation orientations."
)
```

::: {.callout-important}
## Label verification
The `case_when` mapping above must be verified against the cluster means table
below. If the cluster numbering changes (e.g., due to a different R version or
platform), the labels must be re-mapped to match the expected theoretical
profiles. The expected sizes are approximately 433, 705, 378, 3269.
:::

## Table of cluster means on VARS5 (raw scale)

```{r table-means-raw}
#| label: table-means-raw

table_means <- df_out %>%
  filter(!is.na(orientation)) %>%
  group_by(orientation) %>%
  summarise(
    n = n(),
    across(all_of(VARS5), ~ mean(as.numeric(.x), na.rm = TRUE)),
    .groups = "drop"
  )

knitr::kable(table_means, digits = 3,
             caption = "Mean values of VARS5 by acculturation orientation (raw scale).")
```

## Cluster profiles on standardized scale (z-means)

```{r zmeans-heatmap}
#| label: zmeans-heatmap
#| fig-cap: "Cluster profiles on VARS5 (z-score means)."
#| fig-width: 6
#| fig-height: 4

Z <- as.data.frame(X)
Z$orientation <- df_out$orientation[complete_idx]

z_means <- Z %>%
  filter(!is.na(orientation)) %>%
  group_by(orientation) %>%
  summarise(across(all_of(VARS5), ~ mean(.x, na.rm = TRUE)), .groups = "drop") %>%
  pivot_longer(-orientation, names_to = "variable", values_to = "z_mean")

ggplot(z_means, aes(x = variable, y = orientation, fill = z_mean)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(z_mean, 2)), size = 3) +
  scale_fill_gradient2(low = "steelblue", mid = "white", high = "firebrick",
                       midpoint = 0) +
  labs(x = NULL, y = NULL, fill = "z-mean") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

## PCA visualization

```{r pca-scatter}
#| label: pca-scatter
#| fig-cap: "PCA view of the VARS5 space colored by GMM k=4 orientation."
#| fig-width: 6
#| fig-height: 4.5

pca <- prcomp(X, center = FALSE, scale. = FALSE)
pca_df <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  orientation = df_out$orientation[complete_idx]
)

ggplot(pca_df, aes(PC1, PC2, color = orientation)) +
  geom_point(alpha = 0.3, size = 1) +
  labs(color = "Orientation") +
  theme_minimal() +
  theme(legend.position = "bottom")
```

## Political profiles by orientation

```{r political-profiles}
#| label: political-profiles

POL_PRESENT <- intersect(POL_VARS, names(df_out))

if (length(POL_PRESENT) > 0) {
  pol_table <- df_out %>%
    filter(!is.na(orientation)) %>%
    group_by(orientation) %>%
    summarise(
      n = n(),
      across(all_of(POL_PRESENT), ~ mean(as.numeric(.x), na.rm = TRUE)),
      .groups = "drop"
    )

  knitr::kable(pol_table, digits = 3,
               caption = "Mean political attitudes by acculturation orientation.")
}
```

### Binary vs. bidirectional summary

```{r binary-summary}
#| label: binary-summary

n_total <- sum(!is.na(df_out$orientation))
n_binary <- sum(df_out$orientation %in% c("Culture Affirming", "Assimilationist"),
                na.rm = TRUE)
n_hybrid <- sum(df_out$orientation %in% c("Bicultural", "Demicultural"),
                na.rm = TRUE)

cat("Total classified:", n_total, "\n")
cat("Binary orientations (Culture Affirming + Assimilationist):",
    n_binary, sprintf("(%.1f%%)", 100 * n_binary / n_total), "\n")
cat("Hybrid orientations (Bicultural + Demicultural):",
    n_hybrid, sprintf("(%.1f%%)", 100 * n_hybrid / n_total), "\n")
```

\newpage

# Part IV — Validation

Validation addresses whether the GMM k=4 solution is stable or an artifact of
this particular sample. Three procedures test different dimensions of
stability:

1. **Bootstrap stability** (B=1,000): Subsample 80% of the data, refit GMM,
   and compute ARI against the full-sample solution.
2. **Bootstrap confidence intervals** (B=400): Resample with replacement,
   compute cluster means conditional on full-sample labels, and extract 95% CIs.
3. **5-fold cross-validation**: Partition the data into 5 folds, fit GMM on
   each training set (4 folds), and evaluate silhouette on the held-out fold.

## Bootstrap stability (ARI-based, B=1,000)

```{r bootstrap-stability}
#| label: bootstrap-stability

bootstrap_gmm_stability <- function(X, cl_full, B = 1000, frac = 0.80,
                                     seed = SEED) {
  set.seed(seed)
  n <- nrow(X)
  out <- vector("list", B)

  for (b in seq_len(B)) {
    idx <- sample.int(n, size = floor(frac * n), replace = FALSE)
    g <- try(Mclust(X[idx, , drop = FALSE], G = 4, modelNames = "EEV",
                    verbose = FALSE), silent = TRUE)
    if (inherits(g, "try-error") || is.null(g)) {
      out[[b]] <- tibble(b = b, ok = FALSE, ari = NA_real_,
                         sizes = NA_character_)
      next
    }
    cl_b <- as.integer(g$classification)
    ari_b <- mclust::adjustedRandIndex(cl_b, cl_full[idx])
    out[[b]] <- tibble(b = b, ok = TRUE, ari = ari_b,
                       sizes = fmt_sizes(cl_b))
  }
  bind_rows(out)
}

stab <- bootstrap_gmm_stability(X, cl_k4, B = 1000, frac = 0.80, seed = SEED)

stab_summary <- stab %>%
  summarise(
    B = n(),
    converged = sum(ok),
    failed = sum(!ok),
    mean_ARI = mean(ari, na.rm = TRUE),
    sd_ARI = sd(ari, na.rm = TRUE),
    ci_lo = quantile(ari, 0.025, na.rm = TRUE),
    ci_hi = quantile(ari, 0.975, na.rm = TRUE)
  )

knitr::kable(stab_summary, digits = 3,
             caption = "Bootstrap stability: GMM k=4 subsample refits (80%, B=1,000).")
```

```{r bootstrap-histogram}
#| label: bootstrap-histogram
#| fig-cap: "Distribution of bootstrap ARI values (B=1,000 subsample refits)."
#| fig-width: 5
#| fig-height: 3

ggplot(stab %>% filter(ok), aes(x = ari)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  geom_vline(xintercept = stab_summary$mean_ARI, linetype = "dashed",
             color = "firebrick") +
  labs(x = "ARI vs full-sample GMM (k=4)", y = "Count") +
  theme_minimal()
```

## Bootstrap CIs on silhouette and Dunn index

The BAM paper reports bootstrap CIs on silhouette and Dunn. These are
computed separately from the ARI stability analysis above: resample with
replacement, refit GMM, and compute the metric on each bootstrap sample.

```{r bootstrap-metrics}
#| label: bootstrap-metrics

# Dunn index (ratio of min inter-cluster distance to max intra-cluster diameter)
dunn_index <- function(X, cl) {
  cl <- as.integer(factor(cl))
  uq <- sort(unique(cl))
  if (length(uq) < 2) return(NA_real_)

  # Intra-cluster max diameter
  max_intra <- 0
  for (j in uq) {
    members <- X[cl == j, , drop = FALSE]
    if (nrow(members) > 1) {
      d <- as.matrix(dist(members))
      max_intra <- max(max_intra, max(d))
    }
  }
  if (max_intra == 0) return(NA_real_)

  # Min inter-cluster distance (subsample for speed)
  min_inter <- Inf
  for (i in seq_along(uq)) {
    for (j in seq_len(i - 1)) {
      m_i <- X[cl == uq[i], , drop = FALSE]
      m_j <- X[cl == uq[j], , drop = FALSE]
      # Subsample large clusters for speed
      if (nrow(m_i) > 200) m_i <- m_i[sample.int(nrow(m_i), 200), , drop = FALSE]
      if (nrow(m_j) > 200) m_j <- m_j[sample.int(nrow(m_j), 200), , drop = FALSE]
      d_ij <- as.matrix(dist(rbind(m_i, m_j)))
      n_i <- nrow(m_i)
      cross <- d_ij[1:n_i, (n_i + 1):(n_i + nrow(m_j))]
      min_inter <- min(min_inter, min(cross))
    }
  }
  min_inter / max_intra
}

bootstrap_metrics <- function(X, cl_full, B = 400, seed = SEED) {
  set.seed(seed)
  n <- nrow(X)
  results <- vector("list", B)

  for (b in seq_len(B)) {
    idx <- sample.int(n, size = n, replace = TRUE)
    X_b <- X[idx, , drop = FALSE]
    cl_b <- cl_full[idx]

    sil_b <- silhouette_subsample(X_b, cl_b, m = 1500, seed = b)
    dunn_b <- try(dunn_index(X_b, cl_b), silent = TRUE)
    if (inherits(dunn_b, "try-error")) dunn_b <- NA_real_

    results[[b]] <- tibble(b = b, silhouette = sil_b, dunn = dunn_b)
  }
  bind_rows(results)
}

bt_metrics <- bootstrap_metrics(X, cl_k4, B = 400, seed = SEED)

metrics_summary <- bt_metrics %>%
  summarise(
    sil_mean = mean(silhouette, na.rm = TRUE),
    sil_lo   = quantile(silhouette, 0.025, na.rm = TRUE),
    sil_hi   = quantile(silhouette, 0.975, na.rm = TRUE),
    dunn_mean = mean(dunn, na.rm = TRUE),
    dunn_lo   = quantile(dunn, 0.025, na.rm = TRUE),
    dunn_hi   = quantile(dunn, 0.975, na.rm = TRUE)
  )

knitr::kable(metrics_summary, digits = 3,
             caption = "Bootstrap 95% CIs for silhouette and Dunn index (B=400).")
```

## Bootstrap confidence intervals on cluster means (B=400)

```{r bootstrap-ci}
#| label: bootstrap-ci

bootstrap_table_ci <- function(df_complete, cl_full, vars, B = 400,
                                seed = SEED) {
  set.seed(seed)
  n <- nrow(df_complete)
  cl <- factor(cl_full, levels = sort(unique(cl_full)))

  res <- vector("list", B)
  for (b in seq_len(B)) {
    idx <- sample.int(n, size = n, replace = TRUE)
    d_b <- df_complete[idx, , drop = FALSE]
    cl_b <- cl[idx]

    means_b <- d_b %>%
      mutate(cluster = cl_b) %>%
      group_by(cluster) %>%
      summarise(across(all_of(vars), ~ mean(.x, na.rm = TRUE)),
                .groups = "drop") %>%
      pivot_longer(-cluster, names_to = "variable", values_to = "mean") %>%
      mutate(b = b)

    res[[b]] <- means_b
  }
  bind_rows(res)
}

# Prepare raw-scale complete data for bootstrapping
df_complete <- as.data.frame(X_raw[complete_idx, ]) %>%
  mutate(across(everything(), as.numeric))

bt <- bootstrap_table_ci(df_complete, cl_k4, VARS5, B = 400, seed = SEED)

# Orientation label mapping
orientation_map <- c(
  "1" = "Culture Affirming",
  "2" = "Assimilationist",
  "3" = "Demicultural",
  "4" = "Bicultural"
)

bt_ci <- bt %>%
  group_by(cluster, variable) %>%
  summarise(
    lo   = quantile(mean, 0.025, na.rm = TRUE),
    hi   = quantile(mean, 0.975, na.rm = TRUE),
    mean = mean(mean, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    orientation = orientation_map[as.character(cluster)]
  ) %>%
  select(orientation, variable, mean, lo, hi) %>%
  arrange(orientation, variable)

knitr::kable(bt_ci, digits = 3,
             caption = "Bootstrap 95% CIs for cluster means on VARS5 (B=400).")
```

## 5-fold cross-validation

```{r kfold-cv}
#| label: kfold-cv

kfold_cv_gmm <- function(X, k_clusters = 4, n_folds = 5, seed = SEED) {
  set.seed(seed)
  n <- nrow(X)
  fold_ids <- sample(rep(1:n_folds, length.out = n))

  results <- vector("list", n_folds)
  for (f in seq_len(n_folds)) {
    train_idx <- which(fold_ids != f)
    test_idx  <- which(fold_ids == f)

    X_train <- X[train_idx, , drop = FALSE]
    X_test  <- X[test_idx, , drop = FALSE]

    # Fit on training data
    g_train <- try(Mclust(X_train, G = k_clusters, modelNames = "EEV",
                          verbose = FALSE), silent = TRUE)

    if (inherits(g_train, "try-error") || is.null(g_train)) {
      results[[f]] <- tibble(fold = f, status = "failed",
                             sil_train = NA_real_, sil_test = NA_real_,
                             ch_train = NA_real_, ch_test = NA_real_)
      next
    }

    cl_train <- as.integer(g_train$classification)

    # Predict on test data using the trained model
    cl_test <- as.integer(predict(g_train, newdata = X_test)$classification)

    # Evaluate on both sets
    sil_train <- silhouette_subsample(X_train, cl_train, m = 1500, seed = f)
    sil_test  <- silhouette_subsample(X_test, cl_test, m = 1500, seed = f + 100)
    ch_train  <- ch_index(X_train, cl_train)
    ch_test   <- ch_index(X_test, cl_test)

    results[[f]] <- tibble(
      fold = f, status = "ok",
      sil_train = sil_train, sil_test = sil_test,
      ch_train = ch_train, ch_test = ch_test
    )
  }
  bind_rows(results)
}

cv_results <- kfold_cv_gmm(X, k_clusters = 4, n_folds = 5, seed = SEED)

knitr::kable(cv_results %>% select(-status), digits = 3,
             caption = "5-fold cross-validation: silhouette and CH index on train/test splits.")
```

```{r cv-summary}
#| label: cv-summary

cv_summary <- cv_results %>%
  filter(status == "ok") %>%
  summarise(
    mean_sil_train = mean(sil_train, na.rm = TRUE),
    mean_sil_test  = mean(sil_test, na.rm = TRUE),
    sd_sil_test    = sd(sil_test, na.rm = TRUE),
    mean_ch_train  = mean(ch_train, na.rm = TRUE),
    mean_ch_test   = mean(ch_test, na.rm = TRUE)
  )

knitr::kable(cv_summary, digits = 3,
             caption = "Cross-validation summary: mean silhouette and CH index across folds.")
```

## Validation summary

```{r validation-summary}
#| label: validation-summary

cat("=== Validation Summary ===\n\n")

cat("1. Bootstrap stability (B=1,000, 80% subsample):\n")
cat("   Mean ARI:", round(stab_summary$mean_ARI, 3), "\n")
cat("   95% CI: [", round(stab_summary$ci_lo, 3), ",",
    round(stab_summary$ci_hi, 3), "]\n")
cat("   Failed iterations:", stab_summary$failed, "of", stab_summary$B, "\n\n")

cat("2. Bootstrap CIs on cluster means (B=400):\n")
cat("   All 95% CIs computed; see table above.\n\n")

cat("3. 5-fold cross-validation:\n")
cat("   Mean test silhouette:", round(cv_summary$mean_sil_test, 3), "\n")
cat("   SD test silhouette:", round(cv_summary$sd_sil_test, 3), "\n")
cat("   Mean test CH index:", round(cv_summary$mean_ch_test, 2), "\n")
```

\newpage

# Save Final Clustered Dataset

```{r save-output}
#| label: save-output

out_dir <- here("data", "processed")
dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)

bam_clustered <- df_out
out_file <- here("data", "processed", "bam_clustered_gmm_k4.rda")
save(bam_clustered, file = out_file)

cat("Saved:", out_file, "\n")
cat("Object: bam_clustered (", nrow(bam_clustered), "x",
    ncol(bam_clustered), ")\n")
cat("Columns added: cluster_gmm_k4, orientation\n")

# Save all validation results for Manuscript and Appendix QMDs
# This avoids re-running the 15+ minute bootstrap downstream
validation_results <- list(
  comp_table      = comp_table,
  ari_table       = ari_table,
  stab            = stab,
  stab_summary    = stab_summary,
  bt_ci           = bt_ci,
  bt_metrics      = bt_metrics,
  metrics_summary = metrics_summary,
  cv_results      = cv_results,
  cv_summary      = cv_summary,
  gmm_k4_bic     = gmm_k4$bic,
  gmm_k4_loglik  = gmm_k4$loglik,
  table_means     = table_means,
  z_means         = z_means,
  orientation_map = orientation_map,
  VARS5           = VARS5,
  POL_VARS        = POL_VARS,
  SEED            = SEED
)
val_file <- here("data", "processed", "phase2_validation_results.rda")
save(validation_results, file = val_file)
cat("Saved validation results:", val_file, "\n")
```

# Checklist

- [ ] GMM k=4 sizes should be approximately **433, 705, 378, 3269** (label
  order may differ across platforms; verify against cluster means table).
- [ ] Orientation labels map correctly to theoretical profiles.
- [ ] Bootstrap mean ARI should be high (>0.80), indicating stable recovery.
- [ ] Cross-validation silhouette scores should be consistent across folds.
- [ ] `data/processed/bam_clustered_gmm_k4.rda` exists and contains the
  `orientation` column.

# Session Info

```{r session-info}
#| label: session-info

sessionInfo()
```
